Identification of differences in gene expression implicated in the Adherent-Invasive Escherichia coli phenotype during in vitro infection of intestinal epithelial cells

Introduction Adherent-invasive Escherichia coli (AIEC) is strongly associated with the pathogenesis of Crohn’s disease (CD). However, no molecular markers currently exist for AIEC identification. This study aimed to identify differentially expressed genes (DEGs) between AIEC and non-AIEC strains that may contribute to AIEC pathogenicity and to evaluate their utility as molecular markers. Methods Comparative transcriptomics was performed on two closely related AIEC/non-AIEC strain pairs during Intestine-407 cell infection. DEGs were quantified by RT-qPCR in the same RNA extracts, as well as in 14 AIEC and 23 non-AIEC strains to validate the results across a diverse strain collection. Binary logistical regression was performed to identify DEGs whose quantification could be used as AIEC biomarkers. Results Comparative transcriptomics revealed 67 differences in expression between the two phenotypes in the strain pairs, 50 of which (81.97%) were corroborated by RT-qPCR. When explored in the whole strain collection, 29 DEGs were differentially expressed between AIEC and non-AIEC phenotypes (p-value < 0.042), and 42 genes between the supernatant fraction of infected cell cultures and the cellular fraction containing adhered and intracellular bacteria (p-value < 0.049). Notably, six DEGs detected in the strain collection were implicated in arginine biosynthesis and five in colanic acid synthesis. Furthermore, two biomarkers based on wzb and cueR gene expression were proposed with an accuracy of ≥ 85% in our strain collection. Discussion This is the first transcriptomic study conducted using AIEC-infected cell cultures. We have identified several genes that may be involved in AIEC pathogenicity, two of which are putative biomarkers for identification.

Currently, the genetic elements and mechanisms underlying the AIEC phenotype are not fully understood, and no molecular markers exist for its identification.Detection still relies on phenotypic assays on infected cell cultures, making it time-consuming and poorly standardized.Previous studies using molecular methods to identify genes associated with AIEC and widely distributed within the pathotype were unsuccessful (Dogan et al., 2014;Vazeille et al., 2016;Ceśpedes et al., 2017).Comparative genomics studies failed to identify any definitive molecular signatures for AIEC, although several putative biomarkers were proposed, with accuracies ranging from 56% to 85% (Deshpande et al., 2015;Zhang et al., 2015;Desilets et al., 2016;O'Brien et al., 2017;Camprubı-Font et al., 2018).Further analyses are required to determine the utility of these biomarkers across diverse E. coli strains from different geographical regions and phylogenetic backgrounds.
Previously, we proposed using three single nucleotide polymorphisms (SNPs) that, when combined, achieved an accuracy of 81% for AIEC detection (Camprubı-Font et al., 2018).However, further investigation revealed that their usefulness was limited to AIEC clones circulating in Spain and not for geographically distant strains (Camprubı-Font et al., 2020).Therefore, there remains a need to identify validated biomarkers for the molecular detection of AIEC.
In a previous study, we identified AIEC/non-AIEC strain pairs with an identical pulsotype but differing ability to invade IECs (Martinez-Medina et al., 2009).We hypothesized that differential gene expression could be responsible for the AIEC phenotype.To date, three comparative transcriptomic studies have been conducted using AIEC strains.Zhang et al. found six genes over-expressed in the AIEC LF82 strain compared to the non-pathogenic strain HS while growing in Luria-Bertani (LB), and these genes were related to bacteriophage infection and inorganic ion transport and metabolism (Zhang et al., 2015).Delmas et al. compared AIEC LF82 and non-pathogenic E. coli strain K-12 gene expression in the presence and absence of bile salts.The results showed that bile salts induced over-expression of six genes involved in ethanolamine utilization in LF82 but not K-12 (Delmas et al., 2019).Recently, Elhenawy et al. used the AIEC NRG875c strain as a model to evaluate differences in gene expression between in vivo and in vitro conditions.They identified the type IV secretion system as a fundamental virulence factor for in vivo survival (Elhenawy et al., 2021).
Although these studies have provided valuable information about AIEC pathophysiology, further research is needed to fully understand the molecular mechanisms underlying AIEC pathogenicity and to identify AIEC-specific traits widely distributed among this phenotype.Moreover, none of these studies assessed the suitability of gene expression differences as biomarkers for the molecular identification of the pathotype.In this study, we sequenced the genes expressed during IEC infection for the first time, the model used to differentiate between invasive and non-invasive E. coli strains in vitro, to identify the genes expressed during adhesion, invasion, and intracellular survival.We also used two genetically close AIEC/non-AIEC strain pairs as models for comparison, which narrowed the differences that could be found between strains and increased the chance of identifying genetic features associated with the AIEC phenotype.Finally, differentially expressed genes were further investigated in a genetically diverse strain collection of AIEC and non-AIEC strains to identify mechanisms widely used in a phylogenetically diverse AIEC collection that could serve as biomarkers.

Comparative transcriptomics between AIEC/non-AIEC strain pairs
RNA-seq was conducted on bacterial mRNA purified extracts from the supernatant fractions of infected IECs (SN fraction) and the eukaryotic cell fractions containing adhered and invading bacteria (A/I fraction) of two AIEC/non-AIEC strain pairs.The AIEC07/ECG04 strain pair was obtained from the ileum of a control subject and the AIEC17/ECG28 strain pair was isolated from the colon of a CD patient in a previous study (Martinez-Medina et al., 2009).Sequences were trimmed, and between 16.0 and 41.9 million reads were obtained for SN fraction samples, while A/I samples ranged from 315.6 to 379.6 million reads.The percentage of reads that mapped to the AIEC UM146 reference genome ranged from 23.2 to 80.2% in SN samples and from 1.5 to 8.0% in A/I samples (Supplementary Table 1).Nonetheless, the coverage of the bacterial genome was similar in the SN (73.1 ± 14.8%) and A/I (85.4 ± 3.7%) fractions.
Comparative transcriptomics between the two AIEC strains and their non-AIEC counterparts revealed a total of 67 differences in gene expression, of which 48 were under-expressed in AIEC (25 in the A/I fraction and 23 in the SN fraction) and 19 were overexpressed (12 in the A/I fraction and 7 in the SN fraction) (Figure 1).These 67 differences in gene expression corresponded to 62 genes (i.e., differentially expressed genes, DEGs) because some DEGs were differentially expressed in more than one comparison.
Regarding the gene expression levels, total fragments per kilobase of transcript per million mapped reads (FPKM) values were lower than 1000 in 73.13% of the studied DEGs.A group of 13 genes had higher total FPKM values, of which 12 were classified as tRNA in the functional categories, and one was classified as sRNA (Figure 1; Supplementary Table 2).

Functional categories of the differentially expressed genes
Functional analysis was conducted to classify the DEGs into eight functional categories (Figure 2).The detailed list of the functions of each gene is specified in Supplementary Table 2. Overall, the most abundant category included genes related to metabolic processes (19 genes), followed by tRNAs (13 genes) Predicted function of the differentially expressed genes (DEGs) in each comparison distributed in eight functional categories.The number of DEGs in each category is shown for the SN and A/I fractions of the two strain pairs.Genes that presented an over-expression in AIEC are shown in black, while genes that showed an under-expression are shown in grey.The DEGs of this figure are listed in Supplementary Table   and genes of unknown function (10 genes).Genes related to adhesion (5 genes), regulatory functions (5 genes), transport (4 genes), and cell division (2 genes) were also identified.The remaining nine genes were included in the "others" category, which comprised genes related to antibiotic resistance, protein degradation, stress response, iron processing, integrases, sRNA, and rRNA.

Confirmation of comparative transcriptomics results using RT-qPCR
The same samples used for RNA-seq were utilized to validate the comparative transcriptomics results by means of RT-qPCR.A total of 61 differences in gene expression were analyzed because data was not obtained for six genes due to incorrect quantification, with values out of range of Fluidigm.
Thirty-three out of the 42 genetic AIEC under-expressions (78.57%) and 17 of the 19 genetic over-expressions (89.47%) detected in RNA-seq were confirmed by RT-qPCR as underand over-expressions, respectively (Figure 3).Moreover, a positive correlation was observed between log 2 fold change values obtained from comparative transcriptomics and log 2 RTA (relative transcript abundance) values obtained from RT-qPCR (r = 0.677, p-value<0.001)(Figure 3).Therefore, these results confirm the reliability of RNA-seq data, and subsequent quantifications of gene expression were performed by RT-qPCR.
Non-purified samples of the strain pairs were also analyzed by RT-qPCR.These samples were obtained before the purification steps of the MICROBEnrich kit and the Ribo-Zero Magnetic Gold kit (epidemiology).The differences in gene expression between purified and non-purified samples were investigated.Gene expression differences were quantified by RT-qPCR in purified and non-purified RNA extracts from the AIEC/non-AIEC strain pairs.The log 2 RTA values obtained from the two types of samples were compared to determine the effects of purification.
Differences in log 2 RTA values were found between purified and non-purified samples of AIEC SN, non-AIEC SN, and AIEC A/I (pvalue ≤ 0.043), whereas no statistically significant differences were found in non-AIEC A/I samples (Table 1).Overall, purified samples showed higher Ct values in the RT-qPCR, indicating a lower number of copies of the transcript.Nonetheless, 34 out of the 59 differences of gene expression studied (57.63%) showed a concordance of over or under-expression between purified and non-purified samples (Supplementary Table 3).

Expression levels of confirmed DEGs in the strain collection
The study of 62 DEGs, identified using RNA-seq and comparative transcriptomics, was extended to a larger collection of 14 AIEC and 23 non-AIEC strains to evaluate whether the differences in expression were widespread among AIEC strains and specific to this group.All the information Correlation between log2 fold change values obtained in comparative transcriptomics and log2 RTA values obtained by RT-qPCR (r = 0.677, p<0.001).Amplification of the DEGs with RT-qPCR was done in the same samples that were sequenced.Genes located in quadrant I revealed an under-expression in AIEC in both techniques, whereas the genes in quadrant III revealed an over-expression (n = 50).Quadrants II and IV (in grey) contain genes with discrepancies between the two techniques (n = 11).related to the strains background, adhesion-invasion index, and intramacrophage survival capacities can be found in Supplementary Table 6.Additionally, 5 genes of interest based on bibliographic research (fimH, chiA, adiA, adiC, and potE) (Low et al., 2013;Charlier and Bervoets, 2019;Chervy et al., 2020), and 10 genes encoded in genomic fragments where other DEGs were identified were included in the analysis.However, five of the 62 DEGs (0090, 0802, 1503, 2635, and 2679) were excluded from the analysis in the strain collection due to incorrect amplification related to high or low concentrations of the gene.
Overall, 72 gene quantifications obtained by RT-qPCR were studied in the strain collection.
Since differences in gene expression can only be compared for samples in which the genes have been amplified, a preliminary analysis was performed to study the frequency of gene amplification in AIEC and non-AIEC samples.Non-amplification can result from the absence of a gene in a given strain or an expression level below the detection limit.Eight genes (0413, 0415, 0420, 0422, 2902 and 2903, 0410f, and 0410e) were amplified more frequently in AIEC strains than in non-AIEC strains (p-value ≤ 0.047) (Figure 4).Despite the high percentage  The mean log2 RTA values of 59/67 gene expression differences of purified and non-purified samples are shown for each phenotype (AIEC and non-AIEC), and a fraction (SN and A/I) since 8/67 of these differences did not show amplification values in the RT-qPCR.A paired t-test was applied in each comparison, and a p-value was given.
of amplification of these genes in AIEC strains (71-100%), they were not AIEC-specific, as these genes were also detected in a high percentage of non-AIEC strains (44-85.7%).Four genes are involved in the colanic acid biosynthesis pathway, one is a probable serine/threonine kinase, and the remaining three genes have unknown functions.
A concordance was found between the differential gene expression data obtained by RNA-seq and RT-qPCR data obtained from the entire strain collection (Supplementary Table 3).In this case, 62 out of 67 differences in gene expression were analyzed, and five were excluded because data were unavailable.Specifically, 17 of the 19 (89.47%) over-expressions and 15 of the 43 (34.88%)underexpressions found in RNA-seq were also over-expressed and underexpressed, respectively, in the non-purified samples of the strain collection studied by RT-qPCR.Thus, an overall validation of 51.61% of the differential expressions was achieved (Figure 5).
Statistically significant differences (p-value ≤ 0.042) between AIEC and non-AIEC strains were found for 29 genes (40.27%).Most of them (N = 26) were detected in the SN fraction, one in A/I (1184), and two (0511 and 1198) were found in both fractions (Table 2; Supplementary Figure 1).Gene expression was also compared between SN and A/I fractions within each strain group.Statistically significant differences (p-value ≤ 0.049) were found in 42 genes (58.33%); two of them were detected in AIEC, 31 in non-AIEC, and nine in both phenotypes (Table 3; Supplementary Figure 1).Most were over-expressed in the A/I fraction (N = 37), while only five genes were over-expressed in the SN fraction.The nine DEGs detected in both phenotypes were over-expressed in the same fraction in AIEC and non-AIEC (1198, 0036, 2857, 1184, 1063, 1196, 0512, 2009, and 0414d) (Supplementary Table 4).

Function of the DEGs validated in the strain collection
Six DEGs identified in the strain collection were related to the arginine pathways (3046, 3046b, 1815, 1255, 1256, and 1257, corresponding to the genes argI, argF, tRNA-Arg, argC, argB, and argH, respectively) (Supplementary Table 2).All of them were over-expressed in AIEC strains in the SN fraction compared to non-AIEC strains (p-value ≤ 0.042).Since differential expression in several genes related to the arginine pathway was statistically significant, three genes related to this amino acid (adiA, adiC, and potE) were studied in order to determine if this differential expression could be related with acid resistance.The genes adiA and adiC belong to the argininedependent acid resistance system, responsible of the arginine decarboxylation to agmatine and the pH-dependent arginine: agmatine transport.The gene potE codifies for the ornithine: putrescine antiporter that belongs to the ornithine-dependent acid resistance system (Charlier and Bervoets, 2019).Similar results were obtained for these genes, with all of them overexpressed in AIEC strains compared to non-AIEC in the SN condition, but in this case, the differences were not statistically significant (Supplementary Table 4).
Moreover, the genes 0036 (tdcA) and 3172b (yjjB) were overexpressed in AIEC strains in SN compared to non-AIEC strains.The tdcA gene encodes for a transcriptional regulator that activates the tdc operon and is related to serine degradation, while yjjB is a putative serine/threonine exporter.
Furthermore, five genes presented functions related to colanic acid biosynthesis and export (0413, 0414, 0418, 0420, and 0415 corresponding to wza, wzb, wcaE, gmd, and wzc, respectively).In non-AIEC strains, wzc, wcaE, and gmd were over-expressed at the A/I fraction compared to the SN, while wzb was over-expressed in AIEC SN compared to AIEC A/I and non-AIEC SN.However, no significant differences were observed, and more strains could be included to verify these findings.
Three genes were related to type 1 fimbriae (1395, 1396, 1397 corresponding to fimC precursor, fimD, and fimF precursor, respectively), all under-expressed in the SN fraction compared to A/I in non-AIEC strains.Furthermore, fimH was included in this analysis, and an increased expression in AIEC SN was observed compared to non-AIEC SN, with a mean of log 2 RTA of 0.49 and -2.13, respectively (p-value = 0.019).
Moreover, three genes were associated with sorbitol biosynthesis (1689, 1690, and 1688 corresponding to srlB, srlAE, and srlD, respectively).They all presented statistically significant differences (p-value ≤ 0.0289 between SN and A/I fractions in non-AIEC strains. Finally, chiA was also detected as a DEG.This gene encodes for a chitinase protein in the bacterial membrane and promotes the adhesion of the AIEC strains in IECs (Low et al., 2013).A significant increase in expression was found in non-AIEC A/I compared to non-AIEC SN (p-value = 0.017).Additionally, an over-expression of chiA was observed in AIEC in both SN and A/I conditions compared to non-AIEC strains, although no statistically significant differences were detected.

Assessment of DEGs as biomarkers for AIEC identification
The applicability of DEGs as molecular markers for AIEC identification was assessed using binary logistic regression for  each gene.The data analysis of SN samples revealed a list of 19 genes, each capable of distinguishing between AIEC and non-AIEC strains with an accuracy ≥ 77.1% (Table 4; Supplementary Table 5).Notably, in three cases, the quantification of a single gene reached an accuracy of 85.7% (Figure 6).Two of these genes are related to colanic acid biosynthesis (0414 and 0420), and the other ( 2505) is a transcriptional regulator cooper-dependent.However, the 0420 gene was dismissed as a candidate since a high statistical dispersion was observed (Figure 6), and no statistically significant differences in gene expression were observed between AIEC and non-AIEC strains when the t-test was performed in the strain collection.In the analysis executed with log 2 RTA obtained in A/I samples, no molecular markers were found to distinguish between the two phenotypes.

Discussion
The AIEC pathotype is considered important in the inflamed gut of patients with CD, but its pathogenic mechanisms are not fully understood, and there are currently no molecular markers for its identification.This study aimed to identify genes involved in AIEC adhesion, invasion, and intracellular replication that could be used as markers for the molecular detection of the pathotype.Comparative transcriptomics was performed between two AIEC/ non-AIEC strain pairs with identical pulsed-field gel electrophoresis (PFGE) patterns, reducing the probability of detecting DEGs inherent to the phylogenetic origin of the strains and increasing the likelihood of detecting genes specifically related to the AIEC phenotype.Gene expression was analyzed during IEC infection to ensure that genes involved in adhesion, invasion, and intracellular replication were expressed.RT-qPCR was used to quantify the expression level of DEGs in a larger collection of AIEC and non-AIEC strains from different phylogroups to validate the results.This approach identified genes associated with the AIEC phenotype and putative molecular markers that could be used for AIEC screening.

Differential gene expression in AIEC during intestinal epithelial cells infection
In the strain collection, 29 genes were differentially expressed between AIEC and non-AIEC strains.Surprisingly, most of these genes (28 out of 29) were found in the SN fraction before adhesion or invasion occurred.This unexpected result may be because the proportion of non-AIEC bacteria adhered to the IECs may have a similar gene expression pattern to AIEC adhered to IECs, and adhered bacteria are much more abundant than intracellular bacteria in these samples.However, the 42 genes differentially expressed between the SN and A/I fractions in the strain collection may be a source of new virulence factors involved in cell adhesion and invasion.Interestingly, 40 of these 42 genes were detected in non-AIEC strains, while only 11 were detected as differentially expressed in AIEC strains.This unexpected result may be explained by the fact that these virulence factors are constitutive genes or present some basal expression in AIEC strains, while in non-AIEC strains, they are only expressed after entering into contact with IECs.
The study of genes related to L-arginine biosynthesis caught our attention, particularly the genes argI, argF, argC, argB, and argH, which are regulated by ArgR, a repressor that acts in the presence of arginine.In the strain collection, these genes were over-expressed in AIEC in the SN fraction.When analyzed individually, these genes were over-expressed in 77.8%, 54.5%, 90.0%, 90.0%, and 100.0% of the AIEC strains, respectively.Interestingly, in a study by Elhenawy et al., in vitro and in vivo gene expression of the AIEC strain NRG857c was compared by comparative transcriptomics, and it was shown that genes involved in the arginine biosynthesis pathway are upregulated in the intestinal environment of animal models.Additionally, these genes are involved in AIEC survival in vivo and are essential for robust gut colonization, as mutants of argBCGH genes were significantly depleted from the murine intestine (Elhenawy et al., 2021).
On the other hand, statistically significant differences were observed in the gene tdcA, which was over-expressed in AIEC SN compared with non-AIEC SN and AIEC A/I.This gene is a transcriptional activator of the tdc operon, responsible for Lserine degradation.A previous study showed that several conditions associated with the inflamed gut, such as glucose starvation and oxygen and nitric oxide, lead to an up-regulation of these genes (Kitamoto et al., 2019).
In addition, AIEC strains have been observed to under-express genes involved in the biosynthesis of colanic acid building blocks (wza, wzb, wcaE, gmd, and wzc) in the A/I fraction compared to non-AIEC strains.Colanic acid is involved in developing mature biofilms but not in bacterial adhesion to different biotic and abiotic surfaces (Zhang and Poh, 2018;Navasa et al., 2019).The capsular polysaccharide colanic acid inhibits binding with several inert surfaces (Hanna et al., 2003).Therefore, the down-regulation of these genes observed in AIEC strains during IEC interaction may lead to increased bacterial adherence.Moreover, we observed that AIEC strains over-express these genes in the SN fraction compared to non-AIEC strains, although only wza and wzb showed statistically significant differences.Consistent with this, it has been demonstrated that colanic acid protects bacteria against complement-mediated killing and enhances E. coli motility (Lippolis et al., 2018).
The genes belonging to the fim operon (fimC precursor, fimD precursor, fimF, and fimH) were over-expressed in AIEC compared to non-AIEC strains in both fractions, although statistically significant differences were only observed in the SN fraction.When comparing the SN and A/I fractions, similar gene expressions were observed in AIEC strains, while in non-AIEC Boxplot of the genes that could predict AIEC phenotype with an accuracy >85% by binary logistic regression.Two (0414 and 2505) were selected as putative biomarkers, and 0420 was dismissed due to high dispersion.Log2 RTA values are represented for all the samples used in this study, classified by phenotype.
strains, an over-expression in the A/I fraction was seen.It is known that the expression of the fim operon is a phase variable and depends on the orientation of an invertible element that contains the promoter.Consistent with our findings, it has been observed that non-AIEC strains mainly present the OFF phase, avoiding the expression of the fim operon (Dreux et al., 2013).The fim operon encodes for the type 1 fimbriae, a well-known virulence factor for AIEC that interacts with the antigen-related cell adhesion molecule 6 (CEACAM6) located on the apical surface of epithelial cells.CEACAM6 expression is upregulated in CD patients, leading to increased interaction (Barnich and Darfeuille-Michaud, 2010).Furthermore, the over-expression of fimH observed in AIEC strains in this study may contribute to these interactions.
Another gene related to adhesion that was over-expressed in AIEC, especially in the SN fraction, was chiA.Furthermore, when comparing the SN and A/I fractions, AIEC strains presented similar expressions, while non-AIEC strains had higher expressions in the A/I fraction.This gene codes for a chitinase, an enzyme that hydrolyzes a long-chain polymer of N-acetylglucosamine called chitin.This protein interacts with human chitinases, which are, in turn, over-expressed in IECs during intestinal inflammation, promoting the adhesion of bacteria (Mizoguchi, 2006;Chen et al., 2011).Low et al. detected a specific interaction between a chitinbinding domain of bacterial ChiA and the N-glycosylation of asparagine 68 residue in mouse Chitinase 3-like-1 (Low et al., 2013).Furthermore, five polymorphisms were observed in a chitin-binding domain of LF82 that were not present in K-12 and were required for direct interaction with IECs.This variant observed in LF82 was found in 35.5% of AIEC strains and 7.4% of non-AIEC strains in the same strain collection (Camprubı-Font et al., 2019).Moreover, we observed an over-expression of the chiA gene in AIEC strains compared to non-AIEC strains, which may lead to increased interaction with human chitinases and enhance bacterial adhesion.

Biomarkers for AIEC molecular detection
Several genetic elements have been proposed as putative molecular markers for AIEC (Dogan et al., 2014;Deshpande et al., 2015;Desilets et al., 2016;Vazeille et al., 2016;Ceśpedes et al., 2017;Camprubı-Font et al., 2018).Desilets et al. suggested a biomarker based on the amplification of three genomic regions with an accuracy of 85%, although it was only present in B2 AIEC strains (Desilets et al., 2016).Vazeille et al. revealed an amplification of two genes that could predict the phenotype with an accuracy of 83% but low sensitivity (31%) (Vazeille et al., 2016).Recently, a molecular marker based on chuA, eefC, and fitA genes has been proposed, with a prevalence of 55.6-77.8% in AIEC strains, in contrast to other E. coli strains (14.3-17.5%)(Saitz et al., 2022).However, this was not AIEC-specific, and further analyses with geographically diverse E. coli strains are needed.Camprubı-Font et al. presented an algorithm based on SNPs sequencing with an accuracy of 81%, but it was not universal for all AIEC strains (Camprubı-Font et al., 2018, 2020).In our study, binary logistical regression analysis revealed two genes whose quantification presents higher sensitivities than other molecular markers previously suggested.In particular, this method can predict the AIEC phenotype with an accuracy of 85.7% in our strain collection.This would represent a faster, more reproducible, and putatively automatable method for AIEC identification.Nonetheless, further studies, including geographically diverse AIEC strains, are needed to determine if these biomarkers are universal.Other E. coli pathotypes, such as ExPEC, must be included to ensure these are AIEC-specific.Apart from testing the specificity in vitro using external strain collections a further step will be to test the applicability of the biomarkers directly in clinical samples.Gene expression will certainly change in in vivo conditions; however, we hypothesize that if the genes are relevant for AIEC pathogenicity it is probable these genes will be also overexpressed in vivo.However, further studies are needed to evaluate the applicability and usefulness of these biomarkers in disease management.

Concluding remark
Our study has increased our knowledge of the molecular mechanisms involved in AIEC pathogenicity.Gene expression levels of AIEC and non-AIEC strains in both SN and A/I fractions have revealed genes and pathways that may be involved in the adhesion and invasion of AIEC to IECs.These results provide a foundation for future investigations of therapeutic targets against AIEC colonization.Furthermore, we have proposed two genes as biomarkers to identify AIEC strains based on quantifying their expression levels.Subsequent studies to validate the distribution of these biomarkers in a geographically diverse collection of AIEC strains and their usefulness in clinical samples are of great interest.

Strains of study, cell line and growth conditions
Two previously isolated and characterized AIEC/non-AIEC strain pairs were used in this study.These pairs correspond to strains with similar genome structure, gene content, identical PFGE patterns, and sequence types, but with a different AIEC phenotype (Martinez-Medina et al., 2009;Camprubı-Font et al., 2018).The AIEC07/ECG04 strain pair was isolated from the ileum of a control patient and belonged to the B1 phylogroup, while the AIEC17/ ECG28 strain pair was obtained from the colon of a CD patient and corresponded to the D phylogroup.

Infection of intestinal epithelial cells
T75 flasks were seeded with 2 × 10 7 I-407 cells and incubated for 20 h.Cells were washed twice with 10 mL of phosphate-buffered saline (PBS; Lonza, Switzerland), and a cell culture medium composed of EMEM (Lonza, Switzerland) with 10% heatdeactivated FBS (Gibco, USA) was added.Infection of cells was performed at a multiplicity of infection (MOI) of 100, with bacteria growing exponentially in LB broth (OD600 = 0.625).After 4 h of incubation at 37°C with 5% CO 2 , the supernatant (SN fraction) was collected, and the eukaryotic cells containing adhered and invading bacteria (A/I fraction) were washed twice with the cell culture medium and recovered with a scraper.Both fractions were immediately kept on ice and centrifuged at 4°C for 10 min at 3000 × g.The supernatant was discarded, and the pellet was washed with 0.5 mL of PBS.

mRNA extraction and purification
Both the SN and A/I fractions were processed equally.The TRIzol Max Bacterial Isolation kit with Max Bacterial Enhancement reagent (Invitrogen, USA) was used to obtain total RNA, following the manufacturer's instructions with the following modifications: an incubation of 15 min at room temperature was done after chloroform addition, and overnight precipitation at -20°C was performed after the addition of isopropanol.Subsequently, a DNase I treatment (Thermo Scientific, USA) was applied using 1 U of DNase for 5 μg of RNA.The samples obtained at this point were named 'non-purified' samples.
To begin RNA purification, mRNA containing a polyA tail, 18S and 28S eukaryote rRNAs were removed with the MICROBEnrich kit (Thermo Scientific, USA) using Oligo MagBeads.For all samples, at least 25 μg of total RNA in a maximum of 30 μl were treated with this kit.If the RNA concentration was insufficient, an additional precipitation step was carried out as recommended in the kit instructions.The Ribo-Zero Magnetic Gold kit (epidemiology) (Illumina, USA) was used on 2.5 μg of sample RNA to remove all rRNAs and a part of miRNA and tRNA, obtaining mainly purified prokaryotic mRNA.The samples obtained at this point were named 'purified' samples.
RNA quantification was done before and after each kit with a Nanodrop spectrophotometer and a Qubit Fluorometer with RNA HS Assay kit (Thermo Scientific, USA).Additionally, RNA quality was verified with a denaturing agarose gel and/or the Total Prokaryotic RNA program of the Agilent RNA 6000 Nano Chip Bioanalyzer (Agilent, USA).

RNA-seq and comparative transcriptomics
The TruSeq Stranded mRNA method (Illumina, USA) was used for cDNA synthesis.Two SN samples and one A/I sample were sequenced for each strain by Illumina Miseq.The sequencing depth varied according to the fraction: 10 million reads for samples mainly composed of bacteria (SN fraction) and 200 million reads for samples from infected cells (A/I fraction).
Sequence reads were analyzed using FastQC,(Andrews, 2010) trimmed, and mapped to the UM146 AIEC reference genome (NC_017632) with the TopHat (Trapnell et al., 2009).Transcripts were assembled using Cufflinks (Trapnell et al., 2012), merged and compared to UM146 by Cuffmerge (Trapnell et al., 2012).Finally, normalization and differential expression analysis were done using Cuffdiff (Trapnell et al., 2012).Gene expression levels were presented as fragments per kilobase of transcript per million mapped reads (FPKM).Four comparisons were performed, in which AIEC transcripts were compared with non-AIEC transcripts for each condition and each pair.Those genes with p ≤ 0.05 were considered differentially expressed genes (DEGs).Fold change was calculated for each DEG by dividing the FPKM value of non-AIEC by the FPKM value of AIEC.All the DEGs were classified into eight functional categories by BlastX and extensive bibliographic research.

Bacterial strains and RNA extraction for validation of differentially expressed genes
All DEGs identified in the comparative transcriptomics analysis were studied in a phylogenetically diverse E. coli strain collection to determine whether these expression differences were widely distributed among AIEC strains and could be used as molecular biomarkers for AIEC identification.The strain collection included 13 AIEC and 22 non-AIEC strains isolated from ileum and colon biopsies of patients with clinically confirmed inflammatory bowel disease and from control subjects (Martinez-Medina et al., 2009).The commensal strain K-12 C600 and the reference AIEC strain LF82 were added to the collection of tested strains.Overall, the strain collection was composed of strains from phylogroups A (16.2%), B1 (8.1%), B2 (59.4%),D (13.5%), and an atypical strain (2.7%) (Supplementary Table 6).The strains were phenotypically characterized, and all genomes had been previously sequenced (Martinez-Medina et al., 2020).
The infection and RNA extraction protocols were the same as those described for the strain pairs in the previous sections (Infection of intestinal epithelial cells and mRNA extraction and purification).However, the samples in this part of the analysis were not treated with the MICROBEnrich or Ribo-Zero Magnetic Gold kit, so non-purified RNA extracts were obtained from the SN and A/I fractions.The strain pairs used for RNA-seq were also included in this part of the analysis, so gene expression levels could be compared between purified and non-purified samples for the AIEC17, AIEC07, ECG28, and ECG04 strains.

RT-qPCR for gene expression quantification
The previously identified DEGs were quantified in the same sequenced samples to validate RNA-seq results, as well as in nonpurified strain pair extracts and a collection of AIEC and non-AIEC strains.
First, 1.5 μg of RNA was reverse transcribed using the High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems, USA) and the RNase Inhibitor (Life Technologies, USA) in a final volume of 20 μL.Since the SN samples of the strain pairs were studied in duplicate, they were mixed equally before cDNA synthesis.
RT-qPCR was performed in a 96 × 96 microfluidic array IFC chip on a BioMark ™ system (Fluidigm, USA).The Fluidigm loading kit-10 IFCs (BMK-M10-96.96)was used following the manufacturer's instructions.Before amplification, a cycle of 10 min at 95°C followed by 16 cycles of 15 s at 95°C and 4 min at 60°C was performed.Three technical replicates were analyzed for each sample.
The relative transcript abundance (RTA) of mRNA for each DEG was calculated with the equation (Pfaffl, 2012):

RTA=
E ▵Ct (reference−sample) (target gene) E ▵Ct (reference−sample) (housekeeping gene gapA) The respective AIEC strain was applied as a reference in the study of the strain pairs.Since RTA was calculated for the non-AIEC strain using the respective AIEC as a reference in the strain pairs, negative log 2 RTA (values<1) indicate over-expression in AIEC, while positive log 2 RTA (>1 value) indicates underexpression in AIEC.In the strain collection, LF82 A/I sample was used as reference to calculate RTA values.Therefore, negative log 2 RTA a lower expression of the sample compared with LF82 A/I while positive log 2 RTA indicates a higher expression.

Statistical analysis of RT-qPCR data
The level of correlation between the log 2 fold change values obtained from the comparative transcriptomics of RNA-seq data and the log 2 RTA values obtained from Fluidigm analysis was determined using Spearman's correlation.The effect of mRNA purification on gene expression ratio in RNA extracts was studied by comparing log 2 RTA values of purified and non-purified samples of the strain pairs using a paired t-test.
To identify which DEGs were differentially expressed between AIEC and non-AIEC phenotypes, gene expression levels were compared between AIEC and non-AIEC strains of the whole strain collection using a Mann-Whitney U test for nonparametric data or a t-test for parametric data.In this analysis, the mean of log 2 RTA values was compared between the AIEC and the non-AIEC groups for each DEG.Previously, a Shapiro-Wilk test was performed to assess if the variables were normally distributed, and the Levene test was used to assess homoscedasticity.Gene expression levels were also compared between SN and A/I fractions using a paired t-test for each phenotype.A chi-squared test was used for each gene to compare the detection frequency between AIEC and non-AIEC strains.
To determine which DEGs could be used to predict the AIEC phenotype, a binary logistic regression analysis was performed for each DEG individually using the log 2 RTA values obtained.Before the analysis, all missing values were replaced.Samples with no amplification of the DEG and a Ct gapA < 18 were considered to have a correct RNA concentration but an undetectable gene expression.To give a value for these samples, the log 2 RTA was calculated by replacing the noamplified Ct with 25, which is the detection limit of Fluidigm.Samples with a Ct gapA > 18 were considered not concentrated enough to ensure accurate quantification of genes with low expression levels.In this case, the missing values were substituted with the mean value of log 2 RTA for each DEG according to pathotype to avoid bias.Only genes amplified in >60% of strains were analyzed, and those with high predictive potential for the AIEC phenotype were selected based on their sensitivity, specificity, and accuracy.
All statistical analyses were performed with either SPSS (IBM Corp. Released 2010.IBM SPSS Statistics for Windows, Version 19.0.Armonk, NY: IBM Corp.) or GraphPad (GraphPad Prism version 5.00 for Windows, GraphPad Software, La Jolla California USA).For all analyses, a p-value ≤ 0.05 was considered statistically significant. 2 .
FIGURE 1 Differentially expressed genes (DEGs) displayed in comparative transcriptomics between AIEC and non-AIEC strain pairs during I-407 infection (p-value< 0.050).Total fragments per kilobase of transcript per million mapped reads (FPKM) and log2 fold change are shown.Symbols point out genes found in two or more comparisons.Genes that presented an over-expression in AIEC are shown in black, while genes that showed an under-expression are shown in grey.(A) DEGs found in the supernatant (SN) fraction of the strain pairs corresponding to mRNA.(B) DEGs found in the fraction containing infected cells with adherent and/or intracellular bacteria (A/I) of the strain pairs, corresponding to mRNA.(C) DEGs found in the SN fraction of the strain pairs correspond to sRNA and/or tRNA.(D) DEGs found in the A/I fraction of the strain pairs that correspond to sRNA and/or tRNA.

FIGURE 4
FIGURE 4Percentage of strains amplified in a collection of clinical isolates for each DEG.Only those DEGs with statistically significant differences between AIEC and non-AIEC strains are shown.*p-value<0.05;**p-value<0.01.Genes that presented an over-expression in AIEC are shown in black, while genes that showed an under-expression are shown in grey.

FIGURE 5
FIGURE 5Percentage of genes corroborated for the different groups of samples performed with RT-qPCR compared with the initial results obtained by RNA-Seq.The number of DEGs is indicated on each occasion.

TABLE 1
Effect of RNA purification on gene expression quantification of the strain pairs.

TABLE 2
Values of log 2 RTA of the genes with a statistically significant expression between AIEC and non-AIEC phenotypes are displayed for a given fraction.
P-values ≤ 0.05 were considered statistically significant and are highlighted in bold.

TABLE 3
Values of log 2 RTA of the genes displaying statistically significant expression between SN and A/I fractions for a given phenotype.

TABLE 4
The binary logistic regression model for the putative molecular markers associated with the AIEC phenotype was calculated from the log 2 RTA values for each DEG.
TP, true positive; TN, true negative; FP, false positive; FN, false negative.Gene expression values were those obtained only in the SN fraction.The percentage and values of successfully classified strains are indicated.Only genes with an accuracy >75% are shown.FIGURE 6